258
17
Genomics
aligning multiple sequences, degrees of kinship can be assigned on the basis of the
score, which has the form
total score equals score for aligned pairs plus score for gaps periodtotal score = score for aligned pairs + score for gaps.
(17.1)
The score is, in effect, the relative likelihood that a pair of sequences is related.
It represents distance, together with the operations (mutations and introduction of
gaps) required to edit one sequence onto the other. Sequence alignment attempts to
maximize the number of matches while minimizing the number of mutations and gaps
required in the editing process. Unfortunately, the relative weights of the terms on the
right-hand side of (17.1) are arbitrary. The main approach to assigning weights to the
terms more objectively is to study many extant sequences from organisms one knows
from independent evidence to be related. In principle, under a given set of conditions
(e.g., a certain level of exposure to cosmic rays), a given mutation presumably has a
definite probability of occurrence; that is, it can, at least in principle, be derived from
an objective set of data according to the frequentist interpretation, but the practical
difficulties and the possibility that such probabilities may be specific to the sequence
neighbouring the mutation make this an unpromising approach.
While with DNA sequences, a nucleotide is—at least to a first approximation—
either matched or not, with polypeptides a substitution might be sufficiently close
chemically so as to be functionally neutral. Hence, if alignments are carried out at
the level of amino acids, exact matches and substitutions are dealt with by compiling
an empirical table, based on chemical or biological knowledge or both, of degrees of
equivalence. 17 There is no uniquely optimal table. To construct one, a good starting
point is the table of amino acids (Table 15.6). Isoleucine should have about the same
score for substitution by leucine as for an exact match and so forth; substitution of
a polar for an apolar group or lysine for glutamic acid (say) would be given low or
negative scores. The biological approach is to look at the frequencies of the different
substitutions in pairs of proteins that can be considered to be functionally equivalent
from independent evidence (e.g., two enzymes that catalyse the same reaction).
In essence, the entries in a scoring matrix are numbers related to the probability of
a residue occurring in an alignment. Typically, they are calculated as (the logarithm
of) the probability of the “meaningful” occurrence of a pair of residues divided by
the probability of random occurrence. Probabilities of “meaningful” occurrences are
derived from actual alignments “known to be valid”. The inherent circularity of this
procedure gives it a temporary and provisional air.
In the case of gaps, the (negative) score might be a single value per gap or could
have two parameters: one for starting a gap, and another, multiplied by the gap length,
for continuing it (called an affine gap cost). This takes some slight account of possible
correlations in the history of changes presumed to have been responsible for causing
the divergence in sequences. The scoring of substitutions considers each mutation to
be an independent event, however.
17 For example, BLOSUM50, a 20 times 2020 × 20 score matrix (histidine scores 10 if replacing histidine,
glutamine 0, alanineminus−3, and so on). The diagonal terms are not equal.